1. Import Libraries

Team Global project on Job Applications Data

# uncomment and run the below line to install the required packages
#install.packages(c("gganimate", "png", "gifski"))
library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ezids)
library(gganimate)
library(png)
library(gifski)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(maps)

2. Import Data

df = data.frame(read.csv('stackoverflow_full.csv', header = TRUE))
xkabledplyhead(df)
Head
Index Age Accessibility EdLevel Employment Gender MentalHealth MainBranch YearsCode YearsCodePro Country PreviousSalary HaveWorkedWith ComputerSkills Employed
0 <35 No Master 1 Man No Dev 7 4 Sweden 51552 C++;Python;Git;PostgreSQL 4 0
1 <35 No Undergraduate 1 Man No Dev 12 5 Spain 46482 Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL 12 1
2 <35 No Master 1 Man No Dev 15 6 Germany 77290 C;C++;Java;Perl;Ruby;Git;Ruby on Rails 7 0
3 <35 No Undergraduate 1 Man No Dev 9 6 Canada 46135 Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL 13 0
4 >35 No PhD 0 Man No NotDev 40 30 Singapore 160932 C++;Python 2 0
#structure of a data frame 
str(df) 
## 'data.frame':    73462 obs. of  15 variables:
##  $ Index         : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Age           : chr  "<35" "<35" "<35" "<35" ...
##  $ Accessibility : chr  "No" "No" "No" "No" ...
##  $ EdLevel       : chr  "Master" "Undergraduate" "Master" "Undergraduate" ...
##  $ Employment    : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ Gender        : chr  "Man" "Man" "Man" "Man" ...
##  $ MentalHealth  : chr  "No" "No" "No" "No" ...
##  $ MainBranch    : chr  "Dev" "Dev" "Dev" "Dev" ...
##  $ YearsCode     : int  7 12 15 9 40 9 26 14 39 20 ...
##  $ YearsCodePro  : int  4 5 6 6 30 2 18 5 21 16 ...
##  $ Country       : chr  "Sweden" "Spain" "Germany" "Canada" ...
##  $ PreviousSalary: int  51552 46482 77290 46135 160932 38915 77831 81319 68507 37752 ...
##  $ HaveWorkedWith: chr  "C++;Python;Git;PostgreSQL" "Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL" "C;C++;Java;Perl;Ruby;Git;Ruby on Rails" "Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL" ...
##  $ ComputerSkills: int  4 12 7 13 2 5 17 4 3 6 ...
##  $ Employed      : int  0 1 0 0 0 0 1 0 0 0 ...

The dataframe has 73462 records. We have index, employemnt, yearscode, yearscodepro and previoussalary as integer type and the rest are of character types. Going ahead we would need to convert some of them to factors.

3. Data preprocessing & Cleaning

Let’s first check for missing values and duplicate values if any.

# Check the missing values and count them in each column
print(colSums(is.na(df)))
##          Index            Age  Accessibility        EdLevel     Employment 
##              0              0              0              0              0 
##         Gender   MentalHealth     MainBranch      YearsCode   YearsCodePro 
##              0              0              0              0              0 
##        Country PreviousSalary HaveWorkedWith ComputerSkills       Employed 
##              0              0              0              0              0

Here we checks the number of unique values in the “Index” column of the dataframe “df” to identify duplicates.

# Check duplicates
length(unique(df$Index))
## [1] 73462

In this we provided basic dataset statistics, including row and column counts, the number of discrete and continuous columns, and memory allocation information.

# Getting the number of rows and columns
num_rows <- dim(df)[1]
num_columns <- dim(df)[2]

# Getting the number of discrete and continuous columns
discrete_columns <- sum(sapply(df, is.factor))
continuous_columns <- sum(sapply(df, is.numeric))

# Checking for missing values
missing_columns <- sum(colSums(is.na(df)) > 0)
missing_observations <- sum(rowSums(is.na(df)) > 0)

# Getting the number of complete rows
complete_rows <- sum(complete.cases(df))

# Calculating the total number of observations
total_observations <- num_rows * num_columns

# Memory allocation
memory_allocation <- format(object.size(df), units = "Mb")

# Displaying the information
cat("Basic Statistics and Raw Counts for the Dataset:\n")
## Basic Statistics and Raw Counts for the Dataset:
cat("Rows:", num_rows, "\n")
## Rows: 73462
cat("Columns:", num_columns, "\n")
## Columns: 15
cat("Discrete columns:", discrete_columns, "\n")
## Discrete columns: 0
cat("Continuous columns:", continuous_columns, "\n")
## Continuous columns: 7
cat("All missing columns:", missing_columns, "\n")
## All missing columns: 0
cat("Missing observations:", missing_observations, "\n")
## Missing observations: 0
cat("Complete Rows:", complete_rows, "\n")
## Complete Rows: 73462
cat("Total observations:", total_observations, "\n")
## Total observations: 1101930
cat("Memory allocation:", format(memory_allocation, units = "auto"), "\n")
## Memory allocation: 18.4 Mb

In this we provided basic dataset statistics, including row and column counts, the number of discrete and continuous columns, and memory allocation information.

num_rows <- dim(df)[1]
num_columns <- dim(df)[2]

discrete_columns <- sum(sapply(df, is.factor))
continuous_columns <- sum(sapply(df, is.numeric))

missing_columns <- sum(colSums(is.na(df)) > 0)
missing_observations <- sum(rowSums(is.na(df)) > 0)

complete_rows <- sum(complete.cases(df))

total_observations <- num_rows * num_columns

statistics <- data.frame(
  Statistic = c("Rows", "Columns", "Discrete Columns", "Missing Observations", "Total Observations"),
  Count = c(num_rows, num_columns, discrete_columns, missing_observations, total_observations)
)

statistics$Percentage <- (statistics$Count / sum(statistics$Count)) * 100

bar_plot <- ggplot(statistics, aes(x = Statistic, y = Percentage)) +
  geom_bar(stat = "identity", width = 0.7, fill = "skyblue") +
  geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5, size = 4) +
  labs(title = "Percentage Breakdown of Dataset Statistics",
       x = "Statistic", y = "Percentage") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_plot)

We notice that there are no missing values and duplicates in the data. So, we have all the observations to work with.

Now that we know this, let us do some cleaning and feature engineering. First we’ll extract the numeric columns.

# Extract the numeric columns
head(subset(df, select = c(names(df)[sapply(df, is.numeric)])))
##   Index Employment YearsCode YearsCodePro PreviousSalary ComputerSkills
## 1     0          1         7            4          51552              4
## 2     1          1        12            5          46482             12
## 3     2          1        15            6          77290              7
## 4     3          1         9            6          46135             13
## 5     4          0        40           30         160932              2
## 6     5          1         9            2          38915              5
##   Employed
## 1        0
## 2        1
## 3        0
## 4        0
## 5        0
## 6        0
  • The index column will not be required, so we will drop it (We can use the in-built index of a dataframe).
  • We will convert employment into factor and change the values for a better sense.
  • We will do the same for employed
df_clean <- subset(df, select = -c(Index))
df_clean$Employment = as.character(df_clean$Employment)
df_clean$Employment[df_clean$Employment == "1"] <- "currently_employed"
df_clean$Employment[df_clean$Employment=="0"] <- "not_currently_employed"
df_clean$Employment = factor(df_clean$Employment)
df_clean$Employed = as.character(df_clean$Employed)
df_clean$Employed[df_clean$Employed == "1"] <- "hired"
df_clean$Employed[df_clean$Employed == "0"] <- "not_hired"
df_clean$Employed = factor(df_clean$Employed)
str(df_clean)
## 'data.frame':    73462 obs. of  14 variables:
##  $ Age           : chr  "<35" "<35" "<35" "<35" ...
##  $ Accessibility : chr  "No" "No" "No" "No" ...
##  $ EdLevel       : chr  "Master" "Undergraduate" "Master" "Undergraduate" ...
##  $ Employment    : Factor w/ 2 levels "currently_employed",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ Gender        : chr  "Man" "Man" "Man" "Man" ...
##  $ MentalHealth  : chr  "No" "No" "No" "No" ...
##  $ MainBranch    : chr  "Dev" "Dev" "Dev" "Dev" ...
##  $ YearsCode     : int  7 12 15 9 40 9 26 14 39 20 ...
##  $ YearsCodePro  : int  4 5 6 6 30 2 18 5 21 16 ...
##  $ Country       : chr  "Sweden" "Spain" "Germany" "Canada" ...
##  $ PreviousSalary: int  51552 46482 77290 46135 160932 38915 77831 81319 68507 37752 ...
##  $ HaveWorkedWith: chr  "C++;Python;Git;PostgreSQL" "Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL" "C;C++;Java;Perl;Ruby;Git;Ruby on Rails" "Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL" ...
##  $ ComputerSkills: int  4 12 7 13 2 5 17 4 3 6 ...
##  $ Employed      : Factor w/ 2 levels "hired","not_hired": 2 1 2 2 2 2 1 2 2 2 ...

Now let’s take a look at the character columns.

# Extract the character columns
head(subset(df_clean, select = c(names(df)[sapply(df, is.character)])))
##   Age Accessibility       EdLevel Gender MentalHealth MainBranch   Country
## 1 <35            No        Master    Man           No        Dev    Sweden
## 2 <35            No Undergraduate    Man           No        Dev     Spain
## 3 <35            No        Master    Man           No        Dev   Germany
## 4 <35            No Undergraduate    Man           No        Dev    Canada
## 5 >35            No           PhD    Man           No     NotDev Singapore
## 6 <35            No        Master    Man           No        Dev    France
##                                                                                      HaveWorkedWith
## 1                                                                         C++;Python;Git;PostgreSQL
## 2  Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL
## 3                                                            C;C++;Java;Perl;Ruby;Git;Ruby on Rails
## 4 Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL
## 5                                                                                        C++;Python
## 6                                                                JavaScript;Python;Docker;Git;MySQL
  • Except Country and Haveworkedwith, the rest are clearly categorical features, so we will convert them to factor data type
  • Edlevel is an ordinal feature
df_clean$Age = factor(df_clean$Age)
df_clean$Accessibility = factor(df_clean$Accessibility)
df_clean$EdLevel = factor(df_clean$EdLevel, ordered = T, 
                          levels = c("Other", "NoHigherEd", "Undergraduate", "Master", "PhD"))
df_clean$Gender = factor(df_clean$Gender)
df_clean$MentalHealth = factor(df_clean$MentalHealth)
df_clean$MainBranch = factor(df_clean$MainBranch)
str(df_clean)
## 'data.frame':    73462 obs. of  14 variables:
##  $ Age           : Factor w/ 2 levels "<35",">35": 1 1 1 1 2 1 2 1 2 2 ...
##  $ Accessibility : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ EdLevel       : Ord.factor w/ 5 levels "Other"<"NoHigherEd"<..: 4 3 4 3 5 4 4 4 3 4 ...
##  $ Employment    : Factor w/ 2 levels "currently_employed",..: 1 1 1 1 2 1 1 1 1 1 ...
##  $ Gender        : Factor w/ 3 levels "Man","NonBinary",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ MentalHealth  : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ MainBranch    : Factor w/ 2 levels "Dev","NotDev": 1 1 1 1 2 1 1 2 1 1 ...
##  $ YearsCode     : int  7 12 15 9 40 9 26 14 39 20 ...
##  $ YearsCodePro  : int  4 5 6 6 30 2 18 5 21 16 ...
##  $ Country       : chr  "Sweden" "Spain" "Germany" "Canada" ...
##  $ PreviousSalary: int  51552 46482 77290 46135 160932 38915 77831 81319 68507 37752 ...
##  $ HaveWorkedWith: chr  "C++;Python;Git;PostgreSQL" "Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL" "C;C++;Java;Perl;Ruby;Git;Ruby on Rails" "Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL" ...
##  $ ComputerSkills: int  4 12 7 13 2 5 17 4 3 6 ...
##  $ Employed      : Factor w/ 2 levels "hired","not_hired": 2 1 2 2 2 2 1 2 2 2 ...

“Other” for now is set to be the lowest level. Ideally it should be excluded from the ranking of the levels. But there are about 10000 observations with “other”, hence we cannot drop so many rows

Let’s check the number of different countries we have.

length(unique(df_clean$Country))
## [1] 172

There are 172 different countries. (Leaving this as it is for the time being)

Let’s see if there is data where their professional coding experience is more than their coding experience. If there is such data, we have to remove it because that’s practically faulty data.

nrow(subset(df_clean, subset = df_clean$YearsCode<df_clean$YearsCodePro))
## [1] 588
# We observe 588 such records
df_clean = subset(df_clean, subset = df_clean$YearsCode>df_clean$YearsCodePro)

Let’s take a look at some plots and distribution in order to deal with outliers.

# Create a list of the columns that we use in combined boxplot
columns_to_plot <- c("YearsCode", "YearsCodePro", "PreviousSalary", "ComputerSkills")

# Set up the plotting area to display all boxplots together
par(mfrow = c(2, length(columns_to_plot)))

# Display the boxplots
for (i in 1:length(columns_to_plot)) {
  boxplot(df_clean[[columns_to_plot[i]]], main = columns_to_plot[i],col = "lightblue")
}
for (i in 1:length(columns_to_plot)) {
  hist(df_clean[[columns_to_plot[i]]], main = columns_to_plot[i], xlab = element_blank(), col = "lightblue")
}

# plotting area layout
par(mfrow = c(2, 1))

So we notice that, all these are right skewed distributions with quite a bit of outliers. All the outliers are on the higher end. Let’s check the crictical value for these outliers.

o_code= quantile(df_clean$YearsCode, probs = 0.75, na.rm = FALSE) + 
  (1.5 * IQR(df_clean$YearsCode))
o_procode = quantile(df_clean$YearsCodePro, probs = 0.75, na.rm = FALSE) + 
  (1.5 * IQR(df_clean$YearsCodePro))
o_presal = quantile(df_clean$PreviousSalary, probs = 0.75, na.rm = FALSE) + 
  (1.5 * IQR(df_clean$PreviousSalary))
o_cskill = quantile(df_clean$ComputerSkills, probs = 0.75, na.rm = FALSE) + 
  (1.5 * IQR(df_clean$ComputerSkills))

print(paste(o_code, o_procode, o_presal, o_cskill))
## [1] "39.5 25.5 193596.875 30.5"

So, as per the IQR method, for it to be an outlier we get:

  • 39.5 years of coding experience
  • 25.5 years of professional coding experience
  • 193596 as the previous salary
  • 30.5 as the number of computer skills

Someone can have 25.5+ years of professional coding experience. So, it doesn’t make sense to remove everything above this value. A good approximation would be 40 years. So we’ll consider anyone who has above 40 years of coding and professional coding experience as an outlier and thus remove such observations.

31 computer skills is quite a lot for some, not so much for others. So let’s go with a higher approximation of 35 to remove outliers for this.

193k seems a good critical value for an outlier, but since we do not know the currency of salary, let’s not remove the higher salary values. There could be people who had an extremely high salary. This scenario isn’t unlikely in our world.

df_clean = subset(df_clean, subset = df_clean$YearsCode < 40)
df_clean = subset(df_clean, subset = df_clean$YearsCodePro < 40)
df_clean = subset(df_clean, subset = df_clean$ComputerSkills < 35)

Lets take a look at the data now:

df_final <- df_clean
head(df_final)
##   Age Accessibility       EdLevel         Employment Gender MentalHealth
## 1 <35            No        Master currently_employed    Man           No
## 2 <35            No Undergraduate currently_employed    Man           No
## 3 <35            No        Master currently_employed    Man           No
## 4 <35            No Undergraduate currently_employed    Man           No
## 6 <35            No        Master currently_employed    Man           No
## 7 >35            No        Master currently_employed    Man           No
##   MainBranch YearsCode YearsCodePro Country PreviousSalary
## 1        Dev         7            4  Sweden          51552
## 2        Dev        12            5   Spain          46482
## 3        Dev        15            6 Germany          77290
## 4        Dev         9            6  Canada          46135
## 6        Dev         9            2  France          38915
## 7        Dev        26           18 Germany          77831
##                                                                                                                       HaveWorkedWith
## 1                                                                                                          C++;Python;Git;PostgreSQL
## 2                                   Bash/Shell;HTML/CSS;JavaScript;Node.js;SQL;TypeScript;Git;Express;React.js;Vue.js;AWS;PostgreSQL
## 3                                                                                             C;C++;Java;Perl;Ruby;Git;Ruby on Rails
## 4                                  Bash/Shell;HTML/CSS;JavaScript;PHP;Ruby;SQL;Git;jQuery;Laravel;Ruby on Rails;AWS;MySQL;PostgreSQL
## 6                                                                                                 JavaScript;Python;Docker;Git;MySQL
## 7 C++;HTML/CSS;Java;JavaScript;Kotlin;Node.js;TypeScript;Docker;Git;Kubernetes;Angular;Express;Spring;AWS;Heroku;DynamoDB;PostgreSQL
##   ComputerSkills  Employed
## 1              4 not_hired
## 2             12     hired
## 3              7 not_hired
## 4             13 not_hired
## 6              5 not_hired
## 7             17     hired
print(paste("Number of records in final data:", nrow(df_final)))
## [1] "Number of records in final data: 65906"

4. EDA

Here the DataExplorer::create_report(df) function generates a comprehensive exploratory data analysis (EDA) report for a given dataset (df). This report includes summary statistics, data visualizations, missing value analysis, and other insights to help users understand and preprocess their data effectively.

library(DataExplorer)
#DataExplorer::create_report(df)

Here We identify numeric columns in the dataset, created a directory to save histogram plots, and generates histograms for each numeric column. We uses custom colors for the histograms and saves them as individual PNG files in the specified directory for visualizing the data distribution.

#names of numeric columns
numeric_cols <- sapply(df, is.numeric)
numeric_cols <- names(df)[numeric_cols]

# Creating a directory to save the plots
output_dir <- "numerical_histograms"
dir.create(output_dir, showWarnings = FALSE)


custom_colors <- c("skyblue", "pink", "orange", "purple", "green")

# Looping through the numerical columns and create histograms
for (col in numeric_cols) {
  color_index <- match(col, numeric_cols) %% length(custom_colors) + 1
  fill_color <- custom_colors[color_index]

  # Creating a histogram
  hist_plot <- ggplot(df, aes(x = .data[[col]]) ) +
    geom_histogram(fill = fill_color, color = "darkblue", bins = 20) +
    labs(title = paste("Histogram of", col),
         x = col,
         y = "Frequency") +
    theme_minimal()

  
  hist_output_file <- file.path(output_dir, paste0("histogram_", col, ".png"))
  ggsave(filename = hist_output_file, plot = hist_plot, width = 6, height = 4)

  print(hist_plot)
}

Here we created a map plot of data from the “df_final” dataset, where countries are identified by “Country,” and “Age” values are represented by colors on the map.

# Creating a map of responses by country
world_map <- map_data("world")
ggplot(data = df_final, aes(map_id = Country)) +
  geom_map(aes(fill = Age), map = world_map) +
  expand_limits(x = world_map$long, y = world_map$lat)

Here we created a world map plot using data from the “df_final” dataset. Countries are mapped by “Country,” and “MentalHealth” values are visualized using fill colors.

ggplot(data = df_final, aes(map_id = Country)) +
  geom_map(aes(fill = MentalHealth), map = world_map) +
  expand_limits(x = world_map$long, y = world_map$lat)

Here we created a world map plot with data from the “df_final” dataset. “Country” identifies countries, and “EdLevel” values are represented with fill colors on the map.

ggplot(data = df_final, aes(map_id = Country)) +
  geom_map(aes(fill = EdLevel), map = world_map) +
  expand_limits(x = world_map$long, y = world_map$lat)

Here we generated a world map plot based on data from the “df_final” dataset. The map represents countries by “Country,” and “Employment” values are depicted using fill colors on the map.

ggplot(data = df_final, aes(map_id = Country)) +
  geom_map(aes(fill = Employment), map = world_map) +
  expand_limits(x = world_map$long, y = world_map$lat)

ggplot(data = subset(df_final, subset = Employed=='hired'), aes(EdLevel, after_stat(count))) + 
  geom_bar(aes(fill = Gender), position = 'dodge', alpha = 0.5) +
  labs(title = "Education Level by gender for employed workers", x="Education Level", y="Count")

ggplot(data = df_final, aes(EdLevel)) + 
  geom_bar(aes(fill = Employment), position = 'dodge', alpha = 0.5) +
  labs(title = "Education Level by Employment", x="Education Level", y="Count")

ggplot(data = df_final, aes(MainBranch)) + 
  geom_bar(aes(fill = Age), position = 'dodge', alpha = 0.5) +
  labs(title = "MainBranch by Age", x="MainBranch", y="Count")

ggplot(data = df_final, aes(x = PreviousSalary, fill = MentalHealth)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density plot of Previous Salary by Mental Health", x="Previous Salary", y= 'Density')

ggplot(data = df_final, aes(x = MentalHealth, y = PreviousSalary)) +
  geom_jitter(alpha = 0.5, col = 'navy') +
  labs(title = "Previous Salary by Mental Health", x = 'Mental Health', y="Previous Salary")

plotdata <- df_final %>%
  group_by(MentalHealth) %>%
  summarize(mean_prevsalary = mean(PreviousSalary))

ggplot(data = plotdata, aes(x = MentalHealth, y = mean_prevsalary)) +
  geom_bar(stat = 'identity', alpha = 0.5, aes(fill = MentalHealth)) +
  labs(title = "Mean Previous Salary by Mental Health", x = 'Mental Health', y="Previous Salary")

Here we splits the “HaveWorkedWith” column by “;” in the df_final dataset. It then groups the resulting data by the skill “HaveWorkedWith,” calculates the count of each skill, arranges them in descending order, and stores the top 10 most common skills in the skills_df data frame.

df_final %>%
separate_rows(HaveWorkedWith, sep = ";") %>% 
group_by(HaveWorkedWith) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) -> skills_df
head(skills_df, 10)
## # A tibble: 10 × 2
##    HaveWorkedWith Count
##    <chr>          <int>
##  1 JavaScript     44256
##  2 Docker         36386
##  3 HTML/CSS       35992
##  4 SQL            34096
##  5 Git            32828
##  6 AWS            28039
##  7 Python         27983
##  8 PostgreSQL     26990
##  9 MySQL          26079
## 10 TypeScript     24800

We see that Javascript, Docker and HTML/CSS are the skills known by majority of the applicants.

plot <- ggplot(data = head(skills_df, 10), aes(x = reorder(HaveWorkedWith, Count), y = Count)) + 
  geom_bar(stat = 'identity', aes(fill = HaveWorkedWith), alpha = 0.6) + 
  coord_flip() + 
  theme(axis.text.x = element_blank(),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_line(size = 0.1, color = "grey"),
        panel.grid.minor.x = element_line(size = 0.1, color = "grey"),
        plot.background = element_blank()) + 
  geom_text(aes(y = Count, label = Count)) +
  labs(title = "Skills used by number of applicants", 
       x = "Skill", y = "No. of applicants possessing the skill")
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
anim <- plot + 
    transition_states(HaveWorkedWith, transition_length = 4, wrap = FALSE) + 
  shadow_mark() + 
  enter_grow() +
  enter_fade() + 
  ease_aes('sine-in')
  
animate(anim)

#anim_save("Skills.gif", anim)

Now Lets see the skills that make the most money

df_final %>%
separate_rows(HaveWorkedWith, sep = ";") %>% 
group_by(HaveWorkedWith) -> skills_df2

aggregate(skills_df2$PreviousSalary, by = list(skills_df2$HaveWorkedWith), FUN = sum) %>%
arrange(desc(x)) -> skills_df2
colnames(skills_df2) <- c('Skills', 'TotalEarning')


plot <- ggplot(data = head(skills_df2, 10), aes(x = reorder(Skills, TotalEarning), y = TotalEarning)) + 
  geom_bar(stat = 'identity', aes(fill = Skills), alpha = 0.6) + 
  coord_flip() + 
  theme(axis.text.x = element_blank(),
        axis.line = element_blank(),
        axis.ticks = element_blank(),
        panel.background = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_line(size = 0.1, color = "grey"),
        panel.grid.minor.x = element_line(size = 0.1, color = "grey"),
        plot.background = element_blank()) + 
  geom_text(aes(y = TotalEarning, label = paste(round(TotalEarning/1000000, 2), 'M'))) +
  labs(title = "Top 10 earning skills", 
       x = "Skill", y = "Total earnings")

anim <- plot + 
    transition_states(Skills, transition_length = 4, wrap = FALSE) + 
  shadow_mark() + 
  enter_grow() +
  enter_fade() + 
  ease_aes('sine-in')
  
animate(anim)

#anim_save("SkillsEarnings.gif", anim)

Because we have over 65,000 data points, we can use statistical methods that apply to data satisfying normality conditions (because of the central limit theorem).

MAKE SURE TO INCLUDE TEST ASSUMPTIONS

5. Hypothesis Testing

Test 1: Is there a significant difference between employed males and non-males in their respective education levels?

We first want to check whether there is a difference in the number of males vs non-males employed at every level of education.

Because we have over 65,000 data points, we may use statistical methods that apply to data satisfying normality conditions (because of the central limit theorem).

Hence, we use a series of two sample z tests to conduct our first set hypothesis tests.

The assumptions we need to satisfy are: - Large sample size
- Data are collected as a simple random sample (so that observations are independent from one another)

We first want to look at the individuals with a bachelor’s degree. \(H_0:\) There is no difference in the number of undergraduate males who are employed when compared to nonmales \(H_a:\) There is a significant difference in the number of undergraduate males who are employed when compared to nonmales

# We first subset into the different education levels, here undergraduate
undergrad <- subset(df_clean, df_clean$EdLevel == "Undergraduate")


# For the two sample z test, we need our two samples, males and nonmales
undergrad_male <- subset(undergrad, undergrad$Gender == "Man") # Sample 1, males

undergrad_male_employed <- subset(undergrad_male, undergrad_male$Employment == "currently_employed")

undergrad_nonmales <- subset(undergrad, undergrad$Gender != "Man") # Sample 2, females and nonbinary

undergrad_nonmales_employed <- (subset(undergrad_nonmales, undergrad_nonmales$Employment == "currently_employed"))


undergrad_employed_ztest <- prop.test(c(nrow(undergrad_male_employed), nrow(undergrad_nonmales_employed)), c(nrow(undergrad_male),nrow(undergrad_nonmales)))
undergrad_employed_ztest
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(nrow(undergrad_male_employed), nrow(undergrad_nonmales_employed)) out of c(nrow(undergrad_male), nrow(undergrad_nonmales))
## X-squared = 1.0964, df = 1, p-value = 0.2951
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.018869581  0.005340031
## sample estimates:
##    prop 1    prop 2 
## 0.9071626 0.9139273

The proportion of undergraduate males who are employed is: 28611/31539
The proportion of undergraduate nonmales who are employed is: 2113/2312
Our two sample z test statistic is: 1.0964316 which has an associated p-value of 0.2950506.
At and alpha-level of .05 we fail to reject the null hypothesis and conclude that there is not a significant difference in the number of males vs nonmales who have a bachelor’s degree and are employed.

We now look at individuals with a masters. \(H_0:\) There is no difference in the number of males with a master’s degree who are employed when compared to nonmales \(H_a:\) There is a significant difference in the number of males with a master’s degree who are employed when compared to nonmales

# We first subset into the different education levels, here masters
masters <- subset(df_clean, df_clean$EdLevel == "Master")


# For the two sample z test, we need our two samples, males and nonmales
masters_male <- subset(masters, masters$Gender == "Man") # Sample 1, males

masters_male_employed <- subset(masters_male, masters_male$Employment == "currently_employed")


masters_nonmales <- subset(masters, undergrad$Gender != "Man") # Sample 2, females and nonbinary

masters_nonmales_employed <- (subset(masters_nonmales, masters_nonmales$Employment == "currently_employed"))


masters_employed_ztest <- prop.test(c(nrow(masters_male_employed), nrow(masters_nonmales_employed)), c(nrow(masters_male),nrow(masters_nonmales)))
masters_employed_ztest
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(nrow(masters_male_employed), nrow(masters_nonmales_employed)) out of c(nrow(masters_male), nrow(masters_nonmales))
## X-squared = 2755.3, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.4187530 0.4609501
## sample estimates:
##    prop 1    prop 2 
## 0.8844882 0.4446367

The proportion of males with a master’s who are employed is: 14181/16033
The proportion of nonmales with a master’s who are employed is: 1028/2312
Our two sample z test statistic is: 2755.2887877 which has an associated p-value of 0.
At and alpha-level of .05 we reject the null hypothesis and conclude that there is a significant difference in the number of males vs nonmales who have a master’s degree and are employed.

Finally, we examine individuals who have a doctorate degree. \(H_0:\) There is no difference in the number of males with a PhD who are employed when compared to nonmales \(H_a:\) There is a significant difference in the number of males with a PhDwho are employed when compared to nonmales

# We first subset into the different education levels, here PhD
phd <- subset(df_clean, df_clean$EdLevel == "PhD")


# For the two sample z test, we need our two samples, males and nonmales
phd_male <- subset(phd, phd$Gender == "Man") # Sample 1, males

phd_male_employed <- subset(phd_male, phd_male$Employment == "currently_employed")

phd_nonmales <- subset(phd, phd$Gender != "Man") # Sample 2, females and nonbinary

phd_nonmales_employed <- (subset(phd_nonmales, phd_nonmales$Employment == "currently_employed"))


phd_employed_ztest <- prop.test(c(nrow(phd_male_employed), nrow(phd_nonmales_employed)), c(nrow(phd_male),nrow(phd_nonmales)))
phd_employed_ztest
## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(nrow(phd_male_employed), nrow(phd_nonmales_employed)) out of c(nrow(phd_male), nrow(phd_nonmales))
## X-squared = 0.1364, df = 1, p-value = 0.7119
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.03799399  0.06073721
## sample estimates:
##    prop 1    prop 2 
## 0.9063992 0.8950276

The proportion of males with a master’s who are employed is: 1898/2094
The proportion of nonmales with a master’s who are employed is: 162/181
Our two sample z test statistic is: 0.1364029 which has an associated p-value of 0.7118835.
At and alpha-level of .05 we fail to reject the null hypothesis and conclude that there is not a significant difference in the number of males vs nonmales who have a doctorate degree and are employed.

In real-world scenarios, these results might suggest that, in certain employment contexts, master’s education appears to be a distinguishing factor between males and non-males. This could reflect societal or industry biases, or possibly differing educational or career choices among genders. Companies keen on diversity might want to address such disparities in their recruitment and mentorship programs, ensuring opportunities aren’t skewed by gender-based educational trends.

Test 2: Does education level significantly impact employment in the technical field?

Since we the number of known computer skills is continuous and we have more than two categories for level of education, we would like to conduct an ANOVA test to answer this question.

The assumptions we need to satisfy are: - All 5 groups are normally distributed
- Homogeneity of variances
- Error terms are independent of each other

Let’s first visualize the question we are trying to answer by examining the boxplots for distribution of computer skills at all education levels.

loadPkg("ggplot2")
ggplot(df_clean, aes(x=EdLevel, y=ComputerSkills)) + 
  geom_boxplot( colour=c("#ff0000","#FFFF00", "#11cc11","#0000ff","#ff00ff"), alpha = .5, outlier.shape=8, outlier.size=4) +
  labs(x="Education Level", y = "Number of Computer Skills")

There seem to be at least some differences, and we solidify this by conducting our hypothesis test using ANOVA.

\(H_0:\): There is no significant difference in the average number of computer skills acquired when comparing all individuals at every level of education. \(H_a:\) At least one education level group differs from the others in the average number of computer skills known.

anovaRes = aov(ComputerSkills ~ EdLevel, data=df_clean)
anovaSummary = summary(anovaRes)
anovaSummary
##                Df  Sum Sq Mean Sq F value Pr(>F)    
## EdLevel         4   45564   11391   281.8 <2e-16 ***
## Residuals   65901 2664219      40                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Extracting mean computer skills for each education level
mean_skills_Other = mean(subset(df_clean, EdLevel == "Other")$ComputerSkills)
mean_skills_NoHigherEd = mean(subset(df_clean, EdLevel == "NoHigherEd")$ComputerSkills)
mean_skills_Undergraduate = mean(subset(df_clean, EdLevel == "Undergraduate")$ComputerSkills)
mean_skills_Master = mean(subset(df_clean, EdLevel == "Master")$ComputerSkills)
mean_skills_PhD = mean(subset(df_clean, EdLevel == "PhD")$ComputerSkills)

The average number of computer skills for
Other: 14.1340349
No Higher Ed: 14.1333333
Undergraduate: 13.4622906
Master’s: 12.5334456
PhD: 9.9068132
Our ANOVA F test statistic is 281.8 with p-value <2e-16. At an alpha-level of .05, we reject the null hypothesis. We conclude that there is at least one education level for which the average number of computer skills is different from the other groups.

Since we have a highly significant p-value based on our ANOVA test above, we want to conduct a Post Hoc analysis of the test using the standard method, Tukey’s HSD test.

tukeyRes <- TukeyHSD(anovaRes)
tukeyRes
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ComputerSkills ~ EdLevel, data = df_clean)
## 
## $EdLevel
##                                   diff        lwr        upr p adj
## NoHigherEd-Other         -0.0007015306 -0.3577307  0.3563277 1e+00
## Undergraduate-Other      -0.6717442374 -0.8738838 -0.4696047 0e+00
## Master-Other             -1.6005892711 -1.8229420 -1.3782365 0e+00
## PhD-Other                -4.2272216771 -4.6324364 -3.8220069 0e+00
## Undergraduate-NoHigherEd -0.6710427068 -0.9941250 -0.3479604 1e-07
## Master-NoHigherEd        -1.5998877405 -1.9359867 -1.2637888 0e+00
## PhD-NoHigherEd           -4.2265201465 -4.7037211 -3.7493192 0e+00
## Master-Undergraduate     -0.9288450337 -1.0911813 -0.7665088 0e+00
## PhD-Undergraduate        -3.5554774398 -3.9311253 -3.1798296 0e+00
## PhD-Master               -2.6266324060 -3.0135325 -2.2397323 0e+00

Since all p-values for every pairwise comparison was less than 0.05, we conclude that the average number of computer skills for every education level was significantly different from every other level of education.

[real world conclusion; we could look at specific averages to determine which levels of education provide less or more computer skills i.e. quality vs quantity]

In real-world terms, this suggests that as one progresses through higher levels of education, from no higher education to a PhD, there is a substantial increase in computer skills. For employers in the technical field, this could mean that candidates with advanced degrees might bring a higher level of expertise and proficiency in computer-related tasks. However, it’s also essential to consider the quality versus quantity aspect: while a higher degree might indicate more extensive knowledge, it does not necessarily guarantee the quality or applicability of that knowledge in specific job roles.

Test 3: Is age a significant factor in determining whether an individual is a professional developer or not?

Now, since age is a continuous variable and the classification for professional development a categorical one, we would like to use a Chi square test of independence to determine whether the two variables are associated.

The assumptions we need to satisfy are: - The data were drawn as a simple random sample
- All expected counts are greater than 5

\(H_0:\) Whether an individual is a professional developer or not is independent of their age. \(H_a:\) Age and status of being a professional developer are associated with each other.

# We need four categories: <35, NotDev; >35, NotDev; <35 Dev; >35 Dev
notDev_under35 <- subset(df_clean, df_clean$Age == "<35" & df_clean$MainBranch == "NotDev")
notDev_over35 <- subset(df_clean, df_clean$Age == ">35" & df_clean$MainBranch == "NotDev")
dev_under35 <- subset(df_clean, df_clean$Age == "<35" & df_clean$MainBranch == "Dev")
dev_over35 <- subset(df_clean, df_clean$Age == ">35" & df_clean$MainBranch == "Dev")

The contingency table that we perform our chi square test on is the following:

chisq_matrix <- matrix(c(nrow(dev_over35),nrow(dev_under35),nrow(notDev_over35),nrow(notDev_under35)), nrow = 2, ncol = 2)

# Now we conduct a chi square test of independence
chitest = chisq.test(chisq_matrix)
chitest
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  chisq_matrix
## X-squared = 393.1, df = 1, p-value < 2.2e-16

Our chi square test statistic is 393.1041904 with p-value 1.7462162^{-87}. At an alpha-level of .05 we reject the null hypothesis. We conclude that age and being a professional developer are associated.

This data might imply that certain age groups are more inclined or better positioned to pursue developer careers. For educational institutions, this insight can guide curriculum planning, targeting age groups that are more likely to transition into professional development.

Test 4: Does mental health influence previous salary?

To answer our final question, we conduct a two sample t test to compare the mean salary between those with and without mental health issues. (We use a t test as opposed to its normal counterpart, a z test, even though we have a large sample, size since the population variance is unknown.)

The assumptions we need to satisfy are: - The observations are independent of one another
- The data in each group were take from a simple random sample
- The data are relatively normally distributed - The data is continuous in nature
- Homogeneity of variances

\(H_0:\) Whether an individual is a professional developer or not is independent of their age. \(H_a:\) Age and status of being a professional developer are associated with each other.

# Subset the PreviousSalary for individuals with 'Yes' and 'No' in the MentalHealth column
salary_yes <- df_clean$PreviousSalary[df_clean$MentalHealth == "Yes"]
salary_no <- df_clean$PreviousSalary[df_clean$MentalHealth == "No"]

# Checking the mean of both groups just to get an idea
format(mean(salary_yes, na.rm = TRUE), digits=5)
## [1] "72709"
format(mean(salary_no, na.rm = TRUE), digits=5)
## [1] "64652"
# Now we conduct a two-sample t-test on PreviousSalary between the two groups
ttest_2sample_salary_yes_no <- t.test(salary_yes, salary_no)
ttest_2sample_salary_yes_no
## 
##  Welch Two Sample t-test
## 
## data:  salary_yes and salary_no
## t = 17.244, df = 23058, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  7141.483 8973.164
## sample estimates:
## mean of x mean of y 
##  72709.47  64652.15

The average salary for those who responded  ‘yes’ to having mental health issues: $72,709.47
‘no’ to having mental health issues: $64,652.15
Our t test statistic is 17.2441529 with p-value 3.2238049^{-66}. At an alpha-level of .05 we reject the null hypothesis. We conclude that the mean salary of those with mental health issues is significantly different (in this case, higher) than those without.

This shows that those acknowledging mental health issues earned between $7,141 to $8,973 more than those who didn’t. This suggests that a higher salary often carries added responsibilities, leading to stress and potential mental health challenges. Companies and HR teams can use this insight to monitor employee well-being. Job seekers should also be aware that higher salaries might come with increased mental health considerations, aiding them in making informed decisions.

6. Conclusions/Key Takeaways

7. Limitations

The dataset mainly covers the tech industry and has more males from Stack Overflow, so it might not fully represent the wider job market. It would have been better if we had details on soft skills, cultural fit, and specific job roles. There are more men than women in the data. Without currency details for ‘PreviousSalary’, it’s hard to compare salaries. Also, since age is grouped and not exact, it’s less detailed.

8. Further Research

For the future, we have two main suggestions. First, we recommend doing an EEPEN analysis to look more closely at specific things like age or where someone is from. This will give us clearer ideas about those areas. Second, we think it’s a good idea to use the data to make models that can guess if someone will get a job. As we learn more, we can make these models better.